DOCUMENT RESUME 



ED 464 105 



TM 033 786 



AUTHOR 
TITLE 
PUB DATE 
NOTE 



PUB TYPE 

EDRS PRICE 
DESCRIPTORS 



IDENTIFIERS 



Scarloss, Beth 

Assessing Complex Academic Performance at the Group Level. 
2002-04-00 

5 7p . ; Paper presented at the Annual Meeting of the American 
Educational Research Association (New Orleans, LA, April 
1-5, 2002) . 

Reports - Research (143) -- Speeches/Meeting Papers (150) -- 

Tests/Questionnaires (160) 

MF01/PC03 Plus Postage. 

* Academic Achievement; *Elementary School Students; 
*Evaluation Methods; *Grade 6; Group Activities; *Groups; 
Intermediate Grades; *Student Evaluation 
* Comp lex Instruction 



ABSTRACT 
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study was to investigate the effect on learning gains of having students know 
the content and performance standards on which they will be judged as well as 
the effect of using evaluation criteria. This study looks at the 39 student 
groups, a total of 163 sixth graders, involved in the PCI study. The focus 
was on whether group performance is a valid measure of academic performance 
at the group level . The groups were heterogeneously composed on the bases of 
gender, ethnicity, and academic achievement, and they remained stable 
throughout the course of the focal unit. Groups completed 5 instructional 
activities in 5 days. All of the teachers were skilled Complex Instruction 
teachers who had worked with PCI in the past. Regression analysis and 
correlational findings show that group performance scores are a valid measure 
of academic performance at the group level . Data show that the group measure 
is as fair a measure of academic performance as aggregating individual 
performance. The evidence confirms that group level analysis can be done 
successfully for conceptual academic concept. Attachments contain scoring 
rubrics and activity sheets for two group activities. (SLD) 
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Problem 

Can we reliably learn about students' academic mastery when measuring 
performance at the group level? The educational literature is largely silent on 
this question. One area where the assessment literature is increasingly clear is on 
the necessity of matching assessment methods with the classroom context 
(Sheppard, 2000; Stiggins, 2001). But what about when that context includes 
group activities? Most authors do not consider the possibility that classroom 
assessment might include anything other than individual performance (cf. 

Cohen, 1997). Evan as part of a larger work on learning as part of social 
interaction, researchers can maintain an individual focus when it comes to 
assessment (cf. Newman, Griffin, and Cole, 1989). To date, group measures are 
made by aggregating performance on individual measures. This paper examines 
assessing group level academic performance. 

Two different questions arise form the attempt to assess academic 
performance at the group level: "How would you do it?" and "Why not just 
aggregate individual scores?" 

Stated more formally, the first question can be asked: How can one make a 
valid measure of academic content knowledge of a group? I argue that three 
features are necessary: the assessment must be specific to the academic content of 
the activity, the assessment criteria must reflect the intrinsic characteristics of the 
medium called for (e.g., poster, skit), and consistent judgments are necessary to 
assure equitable assessments. 

Developing this measure had many complexities. Deciding on a method 
to evaluate group performance was the first necessary step. Rubrics break a 
whole performance into its constituent parts and explicitly state expectations for 
the listed levels of performance. This type of scoring system offered the greatest 
flexibility in organizing this measure and so is used to judge group academic 
performance. The type of rubric used as an assessment tool, the content to be 
assessed, and the medium through which that content is to be expressed are all 
factors to be considered in developing this assessment. 
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Rubrics for the assessment of group work tend to focus on procedure 
rather than content (see Solomon 1998 for an excellent overview; Webb, 1995). A 
literature search turned up no rubrics centered on features of the academic 
content of the group product. Rubrics generally quantify the amount of work on 
a group product or the contributions of individual group members as a 
proportion of the whole. For example, a rubric might measures the amount of 
history (math, science) that a group accomplishes without examining the 
historical (mathematical, scientific) qualities of the work. 

For each of the five activities in this unit, I designed a rubric to evaluate 
the group's work using the historical content specific to that activity. Each activity in 
the unit made use of a different medium for the academic performance; these 
media fell into two categories, production and performance. Balancing type, 
content, and medium is an intricate operation for which I found little guidance. 

In part, the preceding discussion answers my second question: Why not 
just aggregate individual scores? Individual scores are not feasible for group 
products that are intrinsically not individual endeavors, such as a skit. Perhaps 
the question is better stated: How do group performance and aggregated 
individual performance compare as indicators of academic productivity at the 
group level? To answer that question I explore three measures of academic 
accomplishment: group product/ presentation, individual essay test 
performance, and individual multiple-choice test performance, the latter two 
aggregated to the group level. 

Methodology 

Design 

This study is a secondary analysis of data collected by staff of the Program 
for Complex Instruction (PCI). Their purpose was to investigate the effect on 
learning gains of having students know the content and performance standards 
on which they will be judged as well as the effect of using evaluation criteria 
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(Abram et al 2000). The PCI design, using Campbell and Stanley's (1963) 
terminology, was a quasi-experimental non-equivalent control group design. 

This study looks at the 39 student groups involved in the PCI study. The 
groups were heterogeneously composed on the bases of gender, ethnicity and 
academic achievement and remained stable throughout the course of the focal 
unit. Groups completed five different activities in five continuous days, though 
in varying order. Data were scored for the first, third, and fifth days of the unit. 
Various groups were recorded doing the full range of tasks on each day scored. 
Audio tapes of group presentations, photos of the groups in action, and group 
products were used to generate the group performance scores. Group 
performance was scored by the author and one other scorer. Agreement was 
established separately for rubrics on each of the 5 activities. Scorers reached 
greater than 90% agreement on all of the rubrics. 

Setting and Sample 

Thirty -nine student groups from five sixth-grade classes (N=163), drawn 
from a multiracial, multiethnic, and largely poor sector of California's Central 
Valley, participated in the study during the 1998-1999 school year. The average 
national percentile ranking on the SAT-9 standardized reading test for students 
in the sample was 34.6. Approximately 25% of students in the study were 
designated limited English proficient. Many students reported either Spanish or 
Punjabi as their first language. As is common in many of the communities in 
California's Central Valley, many local residents are immigrants or migrant 
agricultural workers. 

In each of the five classrooms, students completed the same four 
instructional units based on the Complex Instruction (Cl) model of cooperative 
learning (cf. Cohen and Lotan, 1997b). Three classes implemented Complex 
Instruction units with evaluation criteria and two classes implemented identical 
units except for the absence of evaluation criteria. All students enrolled in the 
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participating teachers' classrooms were studied. In all cases, the unit was part of 
the teacher's regular curriculum. Students practiced group skills using 
"skillbuilder" exercises prior to the implementation of the units. These 
skillbuilders provided students with guidelines and practice on how to hold 
academic discussions. In classrooms using the evaluation criteria, the skillbuilder 
focused on talk using the evaluation criteria. In the comparison classrooms, the 
skillbuilder was designed to improve the skills necessary for high-quality group 
discussion. 

All classes completed three preliminary Complex Instruction units to 
acquaint students with group activities, roles, and norms, and to familiarize the 
teachers and students with data collection procedures and instruments. Data 
used in this study were collected during the fourth and final Cl unit, "The 
Importance of the Afterlife in Ancient Egypt." 

Teachers participating in the study were all skilled Complex Instruction 
teachers who had worked with PCI in the past. Each had completed a 10 week 
course on Complex Instruction at California State University at Stanislaus in 
either 1994 or 1995. At the completion of the course, each teacher participated in 
a year-long follow-up and feedback program at their school site, which included 
at least nine classroom visits by their Cl trainer. All of the teachers have made Cl 
units part of their regular curriculum in each school year since their training. 
Three of the five teachers returned to CSU Stanislaus for advanced work on 
training other teachers in Cl; four of the five did advanced work on curriculum 
development. 

PCI staff selected participating teachers on the following criteria: 1) 
effective classroom management skills; 2) solid social studies content knowledge 
and understanding of the curriculum; and 3) successful prior implementation of 
Complex Instruction. Teachers participating in this study taught at year-round 
schools. While units were taught at different points in the calendar year, each 
unit was taught at approximately the same point in the teacher's academic year. 
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Rubrics 



Before detailing the procedures of rubric development, I outline the 
assumptions that underlie this work and infuse the rubrics. I maintain that to 
best evaluate academic performance, three features are necessary. First, the 
assessment must be specific to the academic content of the activity. Second, the 
assessment criteria must reflect the intrinsic characteristics of the medium called 
for (e.g., poster, skit). Finally, consistency among rubrics is necessary to assure 
equitable assessments. Consistency may depend on similar bases for judgment or 
on ensuring that the magnitude of a given element affects outcomes to a similar 
degree across rubrics. 

Two types of rubrics are used here (Solomon 1998). Developmental 
rubrics use substantive differences in product quality as the distinction between 
levels. Task-specific rubrics measure the magnitude of a given characteristic 
(none, few, some, lots). 



Curricula 

In addition to selecting an evaluation tool, another complexity in assessing 
group performance is the organization of the curriculum. Complex instruction 
(Cl) curricula, organized around "big ideas" central to the discipline, include 
both specific factual content and broad conceptual content. Cl units also include 
a performance component requiring groups to display their command of the 
academic content. A summary of the unit, "The Importance of the Afterlife in 
Ancient Egypt," is given in Table 1. The table shows the concrete and conceptual 
academic content and the performance component for each of the activities in the 
unit. These activities require groups to embed the concrete academic content 
within a specific context. The "facts" are applied while exploring conceptual 
content. The performance component of this curriculum requires that groups 
make or do something using the academic information. Further, the task often 
requires groups to make a presentation to the class, explaining to others what 
they have done. 
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Before describing the methods used to assess groups, let me define terms 
used in this discussion. "Concrete content" is the term I use to describe academic 
content that has a simple right/wrong aspect. Facts are concrete content as are 
simple concepts such as "a pharaoh is like a king". "The Importance of the 
Afterlife in Ancient Egypt" included concrete content such as the organization of 
a typical tomb (Activity 3) or the steps in the mummification process (Activity 4). 

"Conceptual content" is used to refer to academic concepts in the unit. 
Concepts can be defined as a combination of ideas that reveal general classes of 
things, behaviors, organizational patterns, etc. Concepts can sound simple (e.g., 
tombs were considered houses for the afterlife) and yet carry large numbers of 
implications and assumptions with them (e.g., houses assume a lifestyle bringing 
issues of decor, servants, comfortable furniture, etc.). As its name suggests, the 
unit studied focused on the concept of how ideas about the afterlife affected the 
way ancient Egyptians lived their everyday lives. Each activity featured one 
aspect of that very broad concept. For example, in Activity 4, groups explore 
how the preservation of the body through mummification allowed the deceased 
to "live" in the afterlife, as he or she lived before death. 

Breaking down the historical content of the unit in another way, the 
various activities are different representations of the same overarching concept. 
Eisner (1994) argues that multiple representations of the same concept, using a 
variety of media, opens generally untapped avenues of access to the academic 
content for students. The tasks in this unit were specifically designed to tie the 
medium of a task to the content featured in that activity. For example, in Activity 
3, groups learn about tomb design by designing a tomb. In addition to allowing 
uncommon access to students, each activity is designed to require a variety of 
intellectual abilities. This expanded range of intellectual abilities gives more 
students access to the curriculum (Lotan, 1997a). Lotan maintains that making a 
concrete product that is closely tied to academic content can be useful to 
enhancing academic writing (Personal communication, November 2000). 
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Figure 1: Elements in group performance, components to be assessed, and rubric 
type used for that assessment 
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Presentation 
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• concrete content 


• concrete content 


Developmental 
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Figure 1 shows the different elements of group performance in Cl 
curricula, the performance components to be assessed, and the type of rubric 
used to assess those elements in the current work. Different types of rubrics are 
suited to different components of academic performance. In evaluating groups' 
work, it is necessary to match the assessment to the form of expression called for. 
It is also important to address the given context in assessing academic content. 
Further, it is necessary to assess the use and sophistication of certain conventions 
of presentation. I chose rubrics that best match the purpose of the assessment. I 
found that two distinct types of rubrics mentioned above, developmental and 
task-specific rubrics, were best suited for this task. 

The unit “The Importance of the Afterlife in Ancient Egypt" includes five 
different activities, each centered on a specific aspect of the unit's big idea. Each 
of the five activities has three components (concrete content, conceptual content, 
and presentation conventions) to be judged for two separate elements (product 
and presentation). Separate rubrics were developed for each of these parts, 
making a total of 30 rubrics. 
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Table 1: Performance Summary for Group Activities in "The Importance of the Afterlife in Ancient Egypt 
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nit big idea: The Importance of the Afterlife in Ancient Egypt - Effects of beliefs about afterlife on the living 



Assessing Content 



Developmental Rubrics 

Distinguishing Academic Content. I use Activity 3: "Tombs - Houses of 
Eternity" and Activity 4: "I Want My Mummy" throughout this section as 
examples of the elements of evaluation. The first of these activities requires both 
a product and presentation, the second requires a performance. The activities use 
different media: a design or model of a tomb versus a song, chant, or dance; they 
have different content, one featuring tomb design and the other the 
mummification process. The activities focus on different parts of the big idea-the 
preparation of a physical home and the idea that the afterlife is incarnate for both 
body and spirit. Both address the central concepts of the unit: the importance of 
the afterlife in ancient Egypt, and the effects of beliefs about the afterlife on the 
living. 

As indicated in Table 1, Activity 3, "Tombs - Houses for Eternity," 
includes as concrete content how tombs were designed and made. The 
conceptual content is the idea that the tomb serves as a home for the deceased's 
next life. Groups are asked to make a design or 3-D model of a tomb. Activity 4, 
"I Want My Mummy," includes as concrete content the stages of the 
mummification process. Conceptually, groups explore ideas of how the body is 
used to live in the afterlife. As their "product," groups perform a song, rap, or 
dance. 

Rubrics for assessing concrete content explicitly call for groups to include 
specific facts or ideas. The concrete content rubric for Activity 3 states, 

"Depiction [is] clearly monument or hidden type of tomb;" the rubric for 
Activity 4 states, "Song, rap, or dance addresses 5 or more major elements [of the 
mummification process] giving details of each step." 

Rubrics for conceptual content require that the ideas explored be placed in 
their historical context by stating the expected application of the concepts of the 
activity. The conceptual content rubric for Activity 3 states "Depiction is 
consistent with ancient Egyptian tomb design... tomb protects occupant's goods 
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in a manner consistent with ancient Egyptian tomb design." The rubric for 
Activity 4 requires that the "Song, rap, or dance makes [a] link between 
mummification and [a] specific spiritual element." 

All rubrics assess content as it is embedded in the specific historical 
context and as it is applied to the given situation. Each application of the 
academic content is centered on the big idea of the importance of the afterlife to 
ancient Egyptians. 

Consistency Across Activities. Consistency was mentioned in the previous 
section as a necessary element for a set of rubrics. I have just outlined the 
techniques I used in order to ensure consistency or having a similar basis for 
judgment. I now discuss the steps taken to assure similar increments between 
scoring levels across the different rubrics. 



Table 2: Distinctions among scoring values for content-based, developmental 
rubrics 



Concrete Content 


Score 


Conceptual Content 


Minimal or missing 


1 


Not present 


Applied but with elements 
missing or wrong 


2 


Incomplete or inconsistent 


Applied with reasoning included 


3 


Ideas consistent with ancient 
Egyptian beliefs — but implicit 


Applied with included reasoning; 
complete, coherent, exemplary 


4 


Ideas consistent with ancient 
Egyptian beliefs — and explicit 



Table 2 shows the distinctions between levels used for concrete content 
and conceptual content rubrics. Performance extremes were easiest to identify, as 
they defined the first and fourth categories for the rubrics. In a number of cases, 
the content of the activity was simply "not there" in the group product or 
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performance. For example, a group assigned to make a song, rap, or dance for 
Activity 4 sang the lyrics: 

"The king is dead 
He died in his bed 
Before he was wed." 

None of the processes of mummification are present. This example 
typifies group performance given a score of "1." Compare the song above to the 
following response to the same assignment. The lyrics below are reprinted 
exactly as they appear on the students' lyric sheet used in the group 
performance. 

[Sung to the tune of Queen's "We Will Rock You"] 



(Chorus) 

We will we will mummify you 

We will we will mummify you 

In the beginning will take out your brains 
your heart and all you orgains all over the place you've 
got salt on your Face from preserving you and leting 
you dry out for at least 40 days. 

We will we will mummify you 

We will we will mummify you 

We put pads under your eye's and wax in your nose 
and rap you with linen for your clothes. 

We will we will mummify you 

We will we will mummify you 



The Ba and Ka will recognize you because of your 
mask and you will be juged because of your past. 

We will We will mummify you 
We will we will mummify you 

We'll put you in a coffin or maybe 2 or 3 then 
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we decorate you with all kinds of jewelry 

We will we will mummify you 
We will we will mummify you 
We will we will mummify you 



0 

ERIC 



This song thoroughly covers the concrete content of the activity and 
explicitly ties mummification procedures (the use of a mask) to the needs of the 
Ba and Ka (to reunite body and spirit). This song is an example of work given a 
score of "4." 

Distinctions between the two middle performance levels were not as clear. 
Table 2 gives the distinctions that typified the various levels. Moving from one 
extreme to the other, it became apparent that some groups were earnest in their 
attempts to do the work, but lacking in mastery of the content. Groups like this 
might include mummification procedures but get them wrong (e.g. "the 
mummification process takes 2 weeks"), misunderstand an aspect of the process 
(e.g., "put him in the coffin and then apply salt"), or leave out large portions of 
the process (e.g., "take out the organs then put him in the sarcophagus"). 
Products that did not communicate mastery of the material, but that did show an 
incomplete command of the material were given a score of "2." Other groups 
demonstrated a sufficient command of the material but did not have either the 
sophistication of exemplary work or lacked the understanding that they had 
completed the assignment. Such groups did the assignment but did not 
recognize they had done it completely or well. Products that communicated 
mastery of the material, but had small gaps in their understanding were given a 
score of "3." 

In summary, academic content was categorized as either concrete or 
conceptual and was scored using developmental rubrics. Distinctions between 
scoring levels were consistent within a category and as similar as possible across 
rubrics. The big idea was woven throughout the rubrics echoing the same ideas 
for all of the activities. 

1 
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Characteristics of the Medium. A third complexity of scoring group 
performance is the range of media called for in the unit. Each activity used a 
different medium; each medium had its own characteristics. I stress the 
importance of matching assessment criteria to the medium called for in the 
activity. Another factor to consider in assessing group performance is the 
consideration that while academic content is contained within the unit, 
conventions for presenting that information using a given medium are not. 
Students are supposed to learn about ancient Egypt in the unit.; there is no 
provision for students to learn songwriting skills. Students may bring skills with 
the various media to the task or they may develop the skills as the unit 
progresses and they observe their peers and receive feedback from their teacher. 
However, development of these skills is not the academic goal of the unit. 

Throughout the scoring of group performance, I attempt to minimize the 
effect of skills in a particular medium on judging content. I acknowledge the 
importance of the match between the medium and the content expressed as I also 
recognize the importance of the pre-existing skill sets that students bring to the 
group task. Such concerns led to the decision to use task-specific rubrics for 
judging presentation conventions, rather than the developmental rubrics used 
for the content assessments. 



Task-Specific Rubrics 

Assessing the various media begged the question "To what extent is this 
product a good example of what it is supposed to be?" Is this model a good 
model? Does this song exemplify what a song should be? Two types of criteria 
emerged in judging presentation conventions. First, I looked for the presence of 
elements intrinsic to the medium. Second, I looked at the sophistication of the 
use of those elements. 

For example, the design for a tomb intrinsically requires a floor plan, and 
a setting, among other things. A sophisticated tomb design might include a 
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mummy in the burial chamber and tomb paintings on the walls. A song about 
the mummification process can be expected to have rhythm and maybe, rhyme. 
A sophisticated mummification song might include a chorus or harmony. 
Rubrics for assessing presentation conventions judged specific elements either as 
"present or absent" or judged them on a scale of "poor /fair /good." 

It should also be noted that some of the conventions for making a 
presentation to the class are consistent across all activities (e.g., speaking loudly 
enough to be heard). Rubrics for the different activities included a section on 
presentation conventions. This rated group performance as "formulaic," 
"mixed," "adequate," or "fluent" on a range of presentation skills including 
"Topics presented in an orderly manner; transitions made between topics" 
(Activity 3) and "Clear separation made between song, rap, or dance and 
remainder of the presentation" (Activity 4). 

Based on my review of the literature, task-specific rubrics are far more 
common than are developmental rubrics. No doubt this occurs because of the 
relative ease of generating a relative scale (none, few, some, lots) as compared to 
the difficulty of specifying group performance with distinct differences based on 
academic content. 



Scoring 

Several types of data were used to score group performance. Audio tapes 
of groups making their presentations were one of the primary sources of data. 
Group presentations were recorded with a tape recorder placed near the 
presenting groups. A recording of the teacher was made at the same time. 
Occasionally, the teacher's tape was used to clarify speech recorded on the group 
performance tape. 

The group product itself was another main source of data. Wherever 
possible, group products were collected by the research staff. Where the group 
product could not be stored, at least one, and usually several, photographs 
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showed the item. Any other data available was used in the scoring where 
appropriate. For instance, in many cases scripts and props were collected. 

Photographs were also taken of the groups, standing before the class, 
making their presentations. In these photos the scorer was able to see the product 
displayed and to see the costumes or props as well as the placement of actors. 

Group performance was scored by the author and one other staff member. 
Scoring rubrics were compiled for each activity using the three sections 
discussed above, concrete content, conceptual content, and presentation 
conventions. Agreement was separately established for each of the 5 activity 
rubrics. Scorers reached greater than 90% agreement for all of the rubrics. 

Each rubric described performance using a 4-point scale for the categories 
of concrete content, conceptual content, and presentation conventions. 
Preliminary analyses indicated a high degree of colinearity among these 
measures. Measures were indexed for the two aspects of performance: product 
and presentation, averaging scores to maintain the 4-point scale. Again, the 
measures were strongly correlated (r = .80, p< .00). The final group performance 
measure adds product and presentation scores and averages them across the 
three rotations, preserving the 4-point scale. 

Results 

I began by asking how can one make a group performance measure. The 
previous sections lay out how it can be done. I turn to the question of to what effect 
can group performance measures be used. 

How do group performance and aggregated individual performance 
compare as indicators of academic productivity at the group level? To answer 
that question I explore the three available measures of academic 
accomplishment: average group performance, individual essay performance 
(essay), and individual multiple-choice post-test performance (test). Table 3 gives 
descriptive statistics for the variables. Average group performance is reported as 
a grand mean of group product and presentation scores (on a scale of 1-4) 
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averaged for the first, middle, and last days of the unit. Essay performance is 
reported as the aggregate of group members' individual scores on two aspects of 
essay writing: factual content and conceptual content. 1 This variable is measured 
on a scale of 2-8. Post-test performance measures students on the same 
30-question, content-referenced, multiple-choice test that was used before 
instruction began. Scores are reported as the percentage of correct answers. All of 
the variables reported are normally distributed. On all three measures, the 
students in this sample "topped out" well below the maximum performance 
possible. 



Table 3: Descriptive statistics for group and aggregated individual performance 
measures (N = 39) 





Mean 


Median 


Standard 

Deviation 


Min. 


Max. 


Average Group Performance 


•2.3 


2.2 


0.60 


1.1 


3.6 


Aggregated Essay Score 


3.6 


3.6 


0.84 


2 


5 


Aggregated Post-Test % 


62 


63 


8.38 


47 


76 



Correlations among the performance variables are given in Table 4. The 
group performance measure correlates with both essay and test scores (r = .52, 
p < .00 and r = .32, p = .04 respectively). This result indicates that the group 
performance measure records similar aspects of performance to both essays and 
tests. The measure of essay performance and the measure for test performance 
are not correlated. Such a result implies that the indicators do not measure the 
same aspects of performance. One might conclude that the group performance 
measure taps aspects of academic performance as measured by both essays and 
tests, though those measures are exclusive of one another. 



1 The other two elements scored were Organization and Mechanics. As those two aspects of essay 
writing are more likely to be tied to skills than knowledge, I do not use them here. 
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Table 4: Correlations (with significance levels) among group and aggregated 
individual performance measures (N = 39) 



Group 

Performance 


Essay 

Performance 


Post-Test 

Percentage 


Average Group Performance 


1.0 






Aggregated Essay Performance 


.52 


1.0 






(.00) 






Aggregated Post-Test Percentage 


.32 


.26 


1.0 




(■04) 


(.11) 





I began this investigation with the assumption that groups are “more than 
the sum of their parts." I was reminded that current thinking does not accept that 
a group is more than its parts, but sees a group as intrinsically different from the 
sum of its parts (McDermott/personal communication November, 2000). Taking 
a group as "more than its parts" assumes that a group can be described by the 
contributions of its members as individuals, plus some ineffable something 
whereby a given individual may transcend what she may have been able to 
accomplish if working alone; the mixture of individuals forms a whole without 
coherence. The idea that a group differs from the sum of its parts assumes that 
once formed, a group is a unique and coherent entity. Current thinking holds 
that comparisons between the two is a juxtaposition of unlike objects. Findings in 
this work support the latter conception of groups. 

It can also be argued that creating a good product prepares group 
members for writing their essays and taking the test. 2 Test and essay 
performance follow the making of group products and their presentation in time. 
One would expect that activities during the course of the unit would contribute to 
students' performance on assessments following the unit. Indeed, Lotan asserts 

2 1 am indebted to E. G. Cohen for her help with this point. 
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that engaging in discussions and manipulating activity materials that are closely 
tied to the unit content are an excellent preparation for academic writing (Lotan, 
personal communication, November 2000). 

In the PCI study, essay tests were given after the other performance 
measures were collected. Rather than calling for recognition of the correct 
answer, as in multiple-choice tests, essay tests require students to recall 
information, analyze, compose, muster arguments, to name a few skills. Because 
of the timing of the essay test and the qualities of academic performance it 
measures, I use essay scores as the outcome variable in comparing group 
performance measures. That is, I regress essay scores aggregated to the group 
level on group performance scores and multiple-choice post-test scores 
aggregated to the group level, controlling for reading scores. Reading percentile 
is included because of its heuristic interest and robust predictive performance in 
other studies. 



Table 5: Standardized coefficients for essay performance regressed on group 
performance and aggregated individual measures; Dependent variable: Essay 



content scores aggregated to the group level (N=39) 




Beta 




Probability 




Predictors 


P 


t 


Level a 


Tolerance 


Aggregated Post-test Percent 


.08 


0.52 


.60 


.87 


Aggregated Reading 


.15 


1.03 


.31 


.95 


Percentile 










Average Group Performance 


.47 


3.10 


.00 


.88 


Model 


Adj. R 2 = 


.24 F = 


5.0 p a < .00 





a p-values are reported as two-tailed tests 



Table 5 shows that group performance predicts essay performance 
(P = .47, p < .00) while post-test performance does not (P = .08, p = .60). In past 
research, reading ability, as measured by standardized tests, has been a robust 
predictor of academic performance. In this case it is not. It is also worth noting 
that multiple-choice test performance is not a significant predictor of essay 
performance. While these results could be indicative of faulty measurement, the 
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magnitude of the different variables (beta weights) appears to indicate that 
group performance is a more proximate measure in predicting essay 
performance. 

Tolerance statistics indicate that, as one might expect, little of the variance 
in reading scores is attributable to the variability in the other academic 
performance measures (tolerance = .95). About 10% of the variability in 
aggregated post-test scores and group performance is accounted for by other 
measures (tolerance = .87 and .88 respectively). Variables in this equation do not 
appear to depend on one another for their predictive capacity. 

Discussion 

Like the correlational findings, regression analysis indicates that group 
performance scores are a valid measure of academic performance at the group 
level. Group performance includes aspects of both essay performance (such as 
expressing one's own ideas) and multiple-choice test performance (for example, 
recognizing content). A teacher who assigns group grades on a group project can 
expect her students to protest that "It's not fair!" to assess them as groups rather 
than as individuals. Perhaps parents or administrators will echo that sentiment. 
These data show that the group measure is as "fair" a measure of academic 
performance as aggregating individual performance. Webb (1995) argues for the 
importance of matching group processes to the goals of the assessment. In this 
case, very close attention has been paid to maintaining the centrality of academic 
content to group processes and outcome measures. 

Further, group performance shows predictive validity toward essay scores 
while aggregated multiple-choice test and standardized reading scores do not. 
The fact that two historically robust academic measures fail to reach significance 
could be interpreted as revealing problems with the academic measures. The 
close attention paid to specific historical content in the curriculum, group 
performance assessment, and tests argues against this interpretation. In my 
opinion, the more reasonable explanation is that repeated exposure to the 
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concepts in group performance activities is a more direct influence on students' 
ability to later recall and use those same concepts in an essay test. 

I hold that the use of group performance, rather than aggregating 
individual performance scores to the group level, is a more proximate measure 
when assessing groups. Empirically, group performance is a better predictor of 
essay performance than is post-test score. 

I find the results of this analysis to lend credence to the argument that a 
group transcends its constituent parts. I remain hesitant to use the term "more" 
in describing that transcendence. In academic settings, "more" must attach to 
improvements in academic performance. These analyses indicate that better 
group performance significantly improves performance on later individual 
academic achievement. Individual performance, aggregated to the group level, 
does not show the same result. These findings support my contention that group 
level performance measures are a better way to measure groups than 
aggregating individual measures to the group level. 

The findings reported here also indicate that aggregating individual 
performance does work as a technique to measure group academic performance. 
While none of the outcome measures used could be said to capture all of 
academic performance, each of the measures tested reflects some aspects of that 
performance and can legitimately be used to measure academic outcomes. 

Implications 

This study confirms that group level analysis can be done successfully for 
conceptual academic content. While the difficulty of conducting academically- 
based group level analyses in school settings may have contributed to their being 
perceived as illegitimate in past, this study supports the idea that group level 
analyses can and should be done. 

Establishing the measurability of group performance has methodological 
as well as practical implications. Sociological researchers outside schools 
routinely use aggregations of individual contributions as a measure of group 
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performance . 3 In those settings, a research and development team for example, 
the researcher does not use the group product as an outcome measure, though it 
may be of critical importance to the organization. Standards such as 
marketability are used as a sole judge of the group's performance rather than an 
assessment based on the characteristic qualities of the product the group was 
charged with creating (for example, cost, manufacturing, functionality, appeal, 
availability of raw materials, etc.). This work establishes the feasibility of using a 
true group product as a measure of a true group task in schools. 

Another set of tools offered by this study are the rubrics for judging 
academic performance at the group level. These rubrics show practitioners how 
to maintain a content focus in assessing groups — and establish that it can be done 
successfully for conceptual academic content. The rubrics can act as models to 
the teacher for explicit statements about performance that are not a recipe-like 
reduction of the assignment. 

When faced with group grades for group work, students everywhere cry 
"It's not fair!" As a teacher, I knew that group assessments could be as fair as any 
other type of assessment. Now I "know" that as a researcher, even using the 
word as advisedly as I now do. Educational researchers routinely aggregate 
individual scores to measure group performance. This study has shown that 
group scores can say as much as aggregated individual scores. While the 
researcher in me waits for the finding to be replicated, the teacher in me 
celebrates having an answer to a persistent and touchy question. I will celebrate 
even more when, and if, future work supports my intuition that a group is 
greater than the sum of its parts and that group measures can appropriately 
capture the contributions of all to what none could do alone. 

Potentially productive research could grow out of this work in the area of 
the measurement of conceptual academic content at the group level. 

3 1 am indebted to B. Cohen for this point. 
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Parallel to the argument that a theory needs to be tested in a variety of 
contexts to gain legitimacy is the point that these instruments should be applied 
to a variety of subject matters to establish their usefulness outside the realm of 
history. Academic disciplines vary in their content and in the pedagogy used to 
communicate that content - compare the science class lab experiment to the 
English class analysis of a sonnet. My performance rubrics reflect the standards 
important in a history class. Further testing would reveal if the same technique 
holds utility in a foreign language or mathematics class. 

My ability to measure academic performance at the group level does not 
mean that large numbers of teachers could do the same. Further work needs to 
be done to establish which aspects of this work hold the greatest utility for 
classroom teachers. In addition, one must also acknowledge the importance of 
preparing teachers understand and implement such measures in their own 
classrooms. Further, teachers may need support assessing conceptual content, as 
opposed to the more usual factual content. This study does not provide methods 
for putting these tools into the hands of classroom teachers. 

Conclusion 

Teaching is a complex endeavor. Teachers who take on added 
complexities, such as group activities or teaching conceptual content in addition 
to facts, need support. It is an unfortunate reality that academic content is not 
always central to the academic performances required of students or the 
assessments that they face. Tools for looking at interaction are necessary if we of 
the educational community want students to go beyond being individuals, 
seated alone at their desks. If we want students to have the opportunity to learn 
deep conceptual content, which we have often maintained is best done in groups, 
we must provide tools for teachers and students to use. Neither group 
assessment nor the study of concepts as well as facts have received sufficient 
development to make them institutions in education. This is a start. 
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Group Performance Scoring Sheet 

The Afterlife in Ancient Egypt Activity 3: Tombs, Houses of Eternity 
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Group Product Rubric 

Activity Three -Tombs, Houses of Eternity 
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Activity Three -Tombs, Houses of Eternity 
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Group Presentation Rubric 

Activity Three -Tombs, Houses of Eternity 
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Group Performance Scoring Sheet 
The Afterlife in Ancient Egypt Activity 4: 1 Want My Mummy 
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Note that scores given for the presentation must equal or exceed the scores given for the product. 
Though the presentation score will at least equal the product score, doing both presentation and 
product does not guarantee a higher score than doing the product alone. 
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The Importance of the Afterlife in Ancient Egypt 
Activity 3: Tombs - Houses of Eternity 

The Ancient Egyptians believed that tombs served as homes in the 
afterlife. Egyptians built two types of tombs for their kings. A tomb was 
either part of a large and obvious structure like the pyramids, or it was 
hidden away in a hard to find place. All of the possessions that were 
necessary in the afterlife were stored in the tomb. It was very important 
that the deceased and his possessions be kept safe in the tomb for all 
eternity. 

As a group, read the resource card, examine the pictures, and discuss the 
following questions: 

1 . Using the Group Information Organizer, discuss and record 
the advantages and disadvantages of pyramid versus 
hidden tombs. 

2. Priests and builders of the secret chambers in tombs 
occasionally stole the treasures. What moral conflicts might 
a priest or a builder face if he knew the location of hidden 
treasures? 

3. What are some of the most important things to consider in 
building a tomb that would ensure a happy afterlife? 

Group Task 

Your group has been selected to design a tomb for the Pharaoh. Prepare 
a presentation for the Pharaoh that includes a recommendation for the 
type of tomb, a picture or 3-D model of what the tomb will look like on 
the inside, and an explanation of your choice. Present your design to 
the class. 



Evaluation Criteria 

• Presentation is convincing. 

• Presentation gives good reasons for the type of tomb chosen. 

• Picture shows that your tomb solves problems that ancient builders worried about. 
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The Importance of the Afterlife in Ancient Egypt 
Activity 3: Tombs-Houses of Eternity 

Individual Report 

Pretend you are an Ancient Pharaoh. Illustrate your idea of a perfect tomb. 
Explain how it would ensure a happy afterlife. 



Evaluation Criteria 

• Answer make clear what type of tomb you, as Pharaoh, prefer. 

• You, as Pharaoh, make at least three points in support of your choice. 
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The Importance of the Afterlife in Ancient Egypt 
Activity 4: 1 Want My Mummy 

The ancient Egyptians had one great wish: to live forever. The 
Egyptians' belief in life after death led to their complex mummification 
process. The Egyptians believed that each soul had two parts: the Ba and 
the Ka. Both the Ba and the Ka were released from the body at the time of 
death. The Ba lived with the family during the day and returned to the 
body at night. The Ka traveled from the body to the other world. In order 
for the Ba and the Ka to return to the body at night, the body had to be 
recognizable. After death, the bodies of pharaohs and nobles were 
mummified to preserve them. Bodies of ordinary people were preserved 
by placing them in the hot, dry sand of the desert. The ancient Egyptians 
believed they would live in their tombs just as they had lived on earth. 

As a group, read the resource card, look at the pictures, and discuss the 
following questions. 

1. How does the practice of mummification tie in with the 
ancient Egyptians beliefs in the Ba and Ka? 

2. Describe the mummification process. Why was each step of 
the process so important? 

3. How might some of the amulets pictured on your resource 
card help the deceased on his journey to the afterlife? 

4. Can you see any purpose for preserving the dead in our 
time? Explain why or why not. 

Group Task 

As a group, create a song, rap, or dance in which you describe the 
mummification of an ancient pharaoh. Include details about the steps in 
preparing the body for burial. 



Evaluation Criteria 

• Performance is easy for the class to follow and understand. 

• Song, rap, or dance gives details about the materials and amulets used. 

• Beliefs about Ba and Ka are part of the presentation. 



Scarloss Assessment at the Group Level Appendices 



The Importance of the Afterlife in Ancient Egypt 
Activity 4: 1 Want My Mummy 

Individual Report 

The Egyptians tucked magical amulets in with mummies to protect them in 
their travels to the afterlife. Create a personal amulet that is important to 
you. Explain why it will be important for you in the afterlife. 



Evaluation Criteria 

• Answer gives at least three reasons amulet will contribute to a happy afterlife. 

• Answer shows connection between the purpose of the amulet and its magical powers. 



Scarloss Assessment at the Group Level Appendices 
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